The Fault, Dear Researchers, is not in Cranfield, But in our Metrics, that they are Unrealistic
نویسندگان
چکیده
As designers of information retrieval (IR) systems, we need some way to measure the performance of our systems. An excellent approach to take is to directly measure actual user performance either in situ or in the laboratory [12]. The downside of live user involvement is the prohibitive cost if many evaluations are required. For example, it is common practice to sweep parameter settings for ranking algorithms in order to optimize retrieval metrics on a test collection. The Cranfield approach to IR evaluation provides low-cost, reusable measures of system performance. Cranfield-style evaluation frequently has been criticized as being too divorced from the reality of how users search, but there really is nothing wrong with the approach [18]. The Cranfield approach effectively is a simulation of IR system usage that attempts to make a prediction about the performance of one system vs. another [15]. As such, we should really be thinking of the Cranfield approach as the application of models to make predictions, which is common practice in science and engineering. For example, physics has equations of motion. Civil engineering has models of concrete strength. Epidemiology has models of disease spread. Etc. In all of these fields, it is well understood that the models are simplifications of reality, but that the models provide the ability to make useful predictions. Information retrieval’s predictive models are our evaluation metrics. The criticism of system-oriented IR evaluation should be redirected. The problem is not with Cranfield — which is just another name for making predictions given a model — the problem is with the metrics. We believe that rather than criticizing Cranfield, the correct response is to develop better metrics. We should make metrics that are more predictive of human performance. We should make metrics that incorporate the user interface and realistically represent the variation in user behavior. We should make metrics that encapsulate our best understanding of search behavior. In popular parlance, we should bring solutions, not problems, to the system-oriented IR researcher. To this end, we have developed a new evaluation metric, time-biased gain (TBG), that predicts IR system performance in human terms of the expected number of relevant documents to be found by a user [16].
منابع مشابه
Evaluation of Classifiers in Software Fault-Proneness Prediction
Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one ...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملWhen Coproduction Is Unproductive; Comment on “Experience of Health Leadership in Partnering with University-Based Researchers in Canada: A Call to ‘Re-Imagine’ Research”
Bowen et al offer a sobering look at the reality of research partnerships from the decision-maker perspective. Health leaders who had actively engaged in such partnerships continued to describe research as irrelevant and unhelpful – just the problem that partnered research was intended to solve. This commentary further examines the many barriers that impede researchers ...
متن کاملMisconduct in Research and Publication
Dear Editor, I read the recent publication on “Misconduct in Research and Publication” with great interest[1]. I agree that misconduct in research and publication is not uncommon. Nevertheless, it is rarely mentioned. In fact, there are many incorrect conceptions among researchers on publication ethics. The milder examples are attempts to report only the “positive outcomes&rdq...
متن کاملComment on Editorial; Best Research for Low Income Countries
Achieving a practical and productive balance in collaborative research between partners from high and lower income countries (North-South Collaborations) requires seeking win-win solutions. This issue requires time to engage each other and to understand each participant’s research priorities and to identify areas of mutual interest. In SACTRC’s experience, key elements include; building researc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012